SoM - Komzet - Analysis

Autor: HSRT & HTWG     Last update: 31/05/2021

Content

  1. **Data loading**
  1. **General statistical description - Funktionsbereich (All features)**
  1. **Statistical description per area - Funktionsbereich**
  1. **Attribute selection**
  1. **Conclusion**

The objective of this data analysis is to understand the underlying patterns of the dataset containing the technology products in order to obtain a reduced set of features (attributes of the table).This reduced set of features will help us to map the responses given by the users to the technological devices, in order to finally be able to generate a recommendation.

1. Data loading

1.1 Libraries

In this section the libraries used for the analysis are loaded.

1.2 Data import

The excel table with the technology products is imported in a dataframe format.

Erste Aufstellung

Zweite Aufstellung

1.3 Dataframe description & preprocessing

This section describes the characteristics of the dataset.

As the criterion for filling in the table was to rate the importance of each attribute with a value ranging from 1 to 5, most of the attributes are ordinal. There are also some categorical attributes. Therefore, we divided the attributes into these two types.

For the analysis is important not to have non-numeric values or missing values. Just in case there exists any case, they are removed. In the next two cells, a search of the missing values takes place and attributes with non-numeric o missing values are eliminated, as a consequence there is a reduction of the number of attributes.

2. General statistical description - Funktionsbereich (All features after preprocessing)

In this section, a statistical description of the entire subset of data with ordinal attributes is presented. In this part of the analysis, the absolute frequencies of the variables are presented in the form of a contingency table. The Chi2 test is also applied to see the relationship between each pair of attributes. For a better understanding, two bar charts (absolute and relative frequencies) per attribute are shown. At the end of the section, the correlation between attributes is shown by applying Kendall's τ coefficient (this coefficient is used instead of Pearson's or Spearman's, as it gives better and more accurate results when working with ordinal attributes).

Distribution of ordinal attributes - Application areas

Some statistical data such as percentile, mean and std of the numeric values of the DataFrame per attribute.

2.1 Contingency tables & Chi2 test

Contingency tables

Chi2 Test

2.1.1 Frecuency diagrams per features (attributes)

Baumaßnahmen

Programmieraufwand

Explizite Unterstüzung

Nachrüstbarkeit

Instandhaltung

Gewährleistung/Garantie

Verfügbarkeit von Ansprechpartnern/Anlaufstellen

Beschaffung (Zugang zu Produkten)

Zukunftspotential

Erweiterbarkeit von Dienstleistungen

Zuverlässigkeit

Herstelleranzahl (bzw. Umsatz)

Technologiealter

Energieersparnis

Geräteauswahl

Reichweite in Gebäuden

Störanfälligkeit

Erweiterbarkeit von Systemen

Datensensibilität

Dantelokalität

Baumaßnahmen

Zuschüsse

Förderkredite

2.2. Correlation between features

Given the non-linearity of the data together with the type of data we have (ordinal attributes), our option is to use either Spearman's rho or Kendall's tau. However, Kendall's is preferred as Kendall correlation is more robust and efficient than Spearman correlation. It means that Kendall correlation is preferred when there are small samples (no more than 5 or 6). In the normal case, the Kendall correlation is preferred than the Spearman correlation because of a smaller gross error sensitivity (GES) (more robust) and a smaller asymptotic variance (AV) (more efficient)

The most correlated characteristics are shown here.

2.3 Correlation Erste Aufstellung vs Zweite Aufstellung

3. Statistical description per general properties - Funktionsbereich (All features)

In this section each application area is studied to see the statistical differences depending on the selected property group (we understand by property group the classification of subsets of attributes (installationsaufwand, wartung und service...) This analysis contains the following points per property group:

3.1. Installationsaufwand

3.2. Wartung und Service

3.3. Langlebigkeit

3.4. Technische Attribute

3.5. Datenschutz

3.6. Präferenz

3.7. Standards

3.8. Unterstützung Pflegebedürftiger

3.9. Fördermöglichkeiten

4. Attribute selection

4.1 Functions used for the analysis

4.2 Entropy

4.2.1 Entropy - All the technological products

Erste Aufstellung

Zweite Aufstellung

4.2.2 Entropy - Application area

Lichtkontrolle

Telemonitoring

Klimaregulation

Innenliegende Sicherheitssensorik

Heimgeräte

Gartensysteme

Fensterbeschattung

Voice assistants

There is just one instance.

Außenliegende sicherheitssensorik

4.2.3 Entropy Selection

4.3 Estimator-based attribute selection

4.3.1 Random Forest

Erste Aufstellung

Name der Techniklösung

Application area

4.3.2 Random Forest - Recursive Feature elimination (RFE)

The goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through any specific attribute or callable. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

Name der Techniklösung

Application Area

Zweite Aufstellung

Name der Techniklösung

Application area

Random Forest - Recursive Feature elimination (RFE)

Application Area

5. Conclusion

In this section, two approaches for the selection of features are explained, based on the results from this analysis. A first approach with the aim of selecting the minimal set of features. The second approach uses all the features that performed best in the entropy analysis without looking for a reduction of the set.

Minimal set of selectable features

Based on the results of the entropy analysis per application area, we obtained several promising features that best represent most of the products in terms of the distribution of the rated values. With these entropy analysis characteristics in mind, we examined each application area to see if the entropy results represent well-distributed columns (values from 1 to 5). Our conclusion after reviewing this is that there are three characteristics that best represent all products, these are:

Correlation between features

Similarity compared to results after random forest application

If we take a look at the results obtained after applying the random forest classifier (RFC), we may think that the above features are not the most important for the RFC to perform the classification. However, this is due to the fact that the RFC is taking all instances for its analysis, and the features that seem to be the most representative ones, show a good distribution of values ranging from 1 to 5 in the whole dataset, but such features do not show a good distribution of values within each application area, which is the reason why it is not appropriate to select features taking into account the whole dataset, but features per application area.

Extensive set of selectable features

In section 4.2.3 Entropy selection, we selected 8 features derived from entropy analysis. These features are the most representative for our objective, which is to formulate questions to assign answers to technological products. The features are listed below.

Correlation between features

Similarity compared to results after random forest application

Same problem as in the previous section

*See section above